Neural Rewards Regression for near-optimal policy identification in Markovian and partial observable environments

نویسندگان

  • Daniel Schneegaß
  • Steffen Udluft
  • Thomas Martinetz
چکیده

Neural Rewards Regression (NRR) is a generalisation of Temporal Difference Learning (TD-Learning) and Approximate Q-Iteration with Neural Networks. The method allows to trade between these two techniques as well as between approaching the fixed point of the Bellman iteration and minimising the Bellman residual. NRR explicitly finds a near-optimal Q-function without an algorithmic framework except Back Propagation for Neural Networks. We further extend the approach by a recurrent substructure to Recurrent Neural Rewards Regression (RNRR) for partial observable environments or higher order Markov Decision Processes. It allows to transport past information to the present and the future in order to reconstruct the Markov property.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Explicit Kernel Rewards Regression for data-efficient near-optimal policy identification

We present the Explicit Kernel Rewards Regression (EKRR) approach, as an extension of Kernel Rewards Regression (KRR), for Optimal Policy Identification in Reinforcement Learning. The method uses the Structural Risk Minimisation paradigm to achieve a high generalisation capability. This explicit version of KRR offers at least two important advantages. On the one hand, finding a near-optimal pol...

متن کامل

Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification

In this paper we present two substantial extensions of Neural Rewards Regression (NRR) [1]. In order to give a less biased estimator of the Bellman Residual and to facilitate the regression character of NRR, we incorporate an improved, Auxiliared Bellman Residual [2] and provide, to the best of our knowledge, the first Neural Network based implementation of the novel Bellman Residual minimisati...

متن کامل

Non-Markovian Control with Gated End-to-End Memory Policy Networks

Partially observable environments present an important open challenge in the domain of sequential control learning with delayed rewards. Despite numerous attempts during the two last decades, the majority of reinforcement learning algorithms and associated approximate models, applied to this context, still assume Markovian state transitions. In this paper, we explore the use of a recently propo...

متن کامل

Solving Problems in Partially Observable Environments with Classiier Systems (experiments on Adding Memory to Xcs) Solving Problems in Partially Observable Environments with Classiier Systems (experiments on Adding Memory to Xcs)

XCS is a classi er system recently introduced by Wilson that differs from Holland's framework in that classi er tness is based on the accuracy of the prediction instead of the prediction itself. According to the original proposal, XCS has no internal message list as traditional classi er systems does; hence XCS learns only reactive input/output mappings that are optimal in Markovian environment...

متن کامل

Agent Neighbourhood for Learning Approximated Policies in DEC-MDP

Resolving multiagent team decision problems, where agents share a common goal, is challenging since the number of states and joint actions is exponential with the number of agents. Even if the resolution of such problems is theoretically possible via models such as DEC-MDP, it is often intractable. In this context, it is important to find a good approximated policy without high resolution compl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007